Final Project AD 654 Marketing Analytics¶


Team Delta

Importing Libraries¶

In [1]:
#Import Library
from scipy.stats import shapiro
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import norm
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
import statsmodels.api as sm
from statsmodels.tsa.api import SimpleExpSmoothing, Holt
import statsmodels.api as sm
import itertools
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn import metrics
from tabulate import tabulate
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
import plotly.express as px
import pycountry
import random
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Part 1 Park Accident Data Visualization¶


Tableau Public Link: https://public.tableau.com/app/profile/yusen.zhou/viz/DeltaProject_AD654/InjuryCauses

Tableau workbook also available as part of the assignment submission.

The first visualization is Injury per Year per Fix Port, a histogram showing the number of accidents reported yearly. The different colors represent whether the ride is portable or fixed. Based on the graph, the injury number increased year by year from 1988 to 2001 and dropped slightly after a peak in 2001, indicating park management improved overall ride safety. Although the majority of accidents occurred on fixed rides, the management team cannot ignore the safety of portable rides, as a small number of cases may be due to the low ownership of portable rides. The second histogram depicts the Age and Gender distribution. Most accidents were concentrated in the under-18 age group. Boys have a higher rate of accidents under the age of 7, but female players have a higher rate of 57.23% overall. Management must alert staff to pay special attention to children when operating rides and post warning signs for more dangerous equipment.

According to the third plot, the Device Category bar chart, the top 6 rides where accidents occur most are the water slide, coaster, go-kart, spinning, water ride, and car & track rides. Looking at the fourth graph together, the Accidents by Park pie chart, although water parks only account for 20.3% of the total accidents, it had two programs among the top 6 accident rides. The management team of the water park needs to establish additional rules to ensure the safety of visitors, such as children needing to be supervised by an adult companion before entering the water slide or water ride. Depending on the Accident versus the Park Responsibility line plot, approximately 5% - 15% of all accidents are attributable to the park, including employee error, machine operation error, and mechanical malfunction. Amusement parks should conduct routine inspections of their rides to ensure they are in good working order and are safe for visitors. Also, all park employees should be trained in safety procedures and know how to operate the rides properly.

Finally, the treemap shows the top-ranked injury types across all records. Most injury types may have been caused by themselves or by other visitors. For minor injuries, there should be first-aid stations throughout the park, staffed by qualified personnel, to provide immediate medical attention in case of an accident. Amusement parks should also have an emergency response plan that outlines how to respond to big emergencies and evacuate visitors in case of an emergency.

Part 2 Summary Stats¶


In [2]:
park_accidents = pd.read_csv("park_accidents.csv")

I. Accidents by State

In [3]:
state_counts = park_accidents.groupby("acc_state")["acc_id"].count().sort_values(ascending=False)
print(state_counts)
acc_state
NJ    5646
CA    3407
TX    3320
FL    1070
PA     269
MI     265
OK     154
IL     152
WI     111
CO     107
OH      81
NY      69
MD      66
KY      37
NV      31
IA      14
NC      14
MA      13
WV      11
CT      11
IN       8
WA       8
MO       6
RI       5
AR       3
VA       2
TN       1
AZ       1
LA       1
GA       1
Name: acc_id, dtype: int64

II. Injury Causes

In [4]:
injury_causes = park_accidents.pivot_table(index="device_category", values="acc_id", aggfunc="count")
sorted = injury_causes.sort_values(by="acc_id", ascending=False)
print(sorted)
                    acc_id
device_category           
water slide           3530
coaster               2748
spinning              1988
go-kart               1767
water ride            1163
cars & track rides    1025
aquatic play           465
other attraction       451
play equipment         403
pendulum               318
vertical drop          252
float attraction       187
wave device            178
inflatable             151
challenge activity      96
unknown                 87
alpine activity         41
trampoline              33
laser tag                1

III. Mean Age by Accident Category

In [5]:
mean_age_by_category = park_accidents.pivot_table(index="category", values="age_youngest", aggfunc="mean")
sorted_mean_age= mean_age_by_category.sort_values(by="age_youngest", ascending=False)
print(sorted_mean_age)
                                                    age_youngest
category                                                        
Illness or neurological symptoms                       38.382075
Illness: Seizure or LOC                                25.573171
Hyperextension or dislocation                          24.083333
Load/Unload: injured when vehicle moved                23.810000
Abrupt stop/drop/lurch                                 22.537037
Fall: patron fell off inner tube, mat or board         22.385816
Body pain (normal motion)                              22.277075
Load/Unload: scrape or stumble                         20.469062
Impact: hit wall or barrier at end of slide runout     19.470588
Restraint too tight                                    19.455882
Collision: patrons collided (participatory)            19.448649
Impact: person hit by ride                             19.298701
Impact: hit something in participatory attraction      19.051142
Injured by foreign object                              17.353774
Collision: operator-controlled vehicles                17.176056
Impact: vaginal or rectal injury                       15.800000
Collision: patron-controlled vehicles                  15.545954
Fall: patron fell from device (participatory)          15.545455
Impact: hit something within ride vehicle              15.482634
Injured in queue or exit                               15.198502
Employee injured                                       14.975610
Entrapment or pinch-point                              14.975410
Collision: patrons collided within vehicle             14.836364
Load/Unload: hit or pinched by restraint               14.054867
Impact: extremity hit something outside carrier        13.315789
Unscheduled stop                                       13.133641
Burn (includes friction burn)                          12.693182
Collision: go-kart crashed (no further descript...     12.142857
Collision: go-kart or bumper car hit stationary...     11.783410
Fall: in climb or play area                            11.777778
Other                                                  11.578947
Environmental issue                                    10.600000
Fall: patron fell from seat, but not carrier           10.384615
Fall: ejection/fall from ride                          10.281588
Seatbelt abrasion or bruising                           9.725490
Choking, water inhalation, suffocation                  8.821429
Electrical shock                                        7.766667
Derailment                                              6.977273
Equipment failure                                       4.515789
Unknown (not enough info)                               3.156627
Awkward landing                                         0.000000

IV. Top 10 manufacturers of the faulty ride with the highest number of injured people

In [6]:
injured_by_manufacturer = park_accidents.groupby("manufacturer")["num_injured"].sum().sort_values(ascending=False)
top_10_manufacturers = injured_by_manufacturer.head(10)
print(top_10_manufacturers)
manufacturer
0                                  2936
In-house                           1167
Intamin AG                          795
WhiteWater West Industries Ltd.     769
Arrow Dynamics                      681
Pro Slide                           562
Zamperla                            511
Chance Manufacturing, Inc.          458
Wisdom Industries, Ltd.             415
Surf Coaster                        377
Name: num_injured, dtype: int64

Summary

According to the Exploratory Data Analysis performed, we found out that New Jersey has the highest number of accidents (5646) followed by California and Texas. In this case, park management in these states should be more particularly vigilant about safety measures and accident prevention.

Among device categories, water slides account for the highest number of accidents (3530), followed by coasters and spinning rides. In this part, Park Management should be more focus on improving the safety of these attraction (such as optimizing the Stardard of Procedure) and Implement additional precautions.

In terms of, mean age by accident category, accidents involving "Illness or neurological symptoms" and "Illness: Seizure or LOC" predominantly affect older individuals, with mean ages of 38.38 and 25.57 years, respectively and categories such as "Derailment," "Electrical shock," and "Choking, water inhalation, suffocation" involve younger individuals with mean ages of 6.98, 7.77, and 8.82 years, respectively. In this case, park management can prioritize safety measures, staff training, and targeted communication efforts to prevent specific types of accidents in specific age.

Lastly, The top 10 manufacturers with the highest number of injured people include both well-known companies and in-house manufacturers. It is crucial for park management to meticulously examine the safety and maintenance records of attractions created by these manufacturers. They should assess whether additional safety measures are needed or even contemplate replacing rides from these manufacturers with ones from companies that have lower accident rates. It is important to highlight that while compiling summary statistics to identify the top 10 manufacturers of faulty rides, we discovered that the majority of accidents were associated with "0," which signifies a lack of data. Park management should take this issue seriously and investigate the source of these accidents, as they are connected to faulty rides.

Part 3 Segmentation and Targeting¶


1. Importing and Exploring Dataset

In [7]:
# Importing ski hotel dataset
hotel = pd.read_csv("ski_hotels.csv")

# Dataset Info
hotel.info()

# Dataset Head
hotel.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 407 entries, 0 to 406
Data columns (total 24 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              407 non-null    int64  
 1   country                 407 non-null    object 
 2   resort                  407 non-null    object 
 3   hotel                   407 non-null    object 
 4   price (£)               407 non-null    int64  
 5   distance_from_lift_(m)  407 non-null    object 
 6   altitude (m)            407 non-null    int64  
 7   totalPiste (km)         407 non-null    int64  
 8   totalLifts              407 non-null    int64  
 9   gondolas                407 non-null    int64  
 10  chairlifts              407 non-null    int64  
 11  draglifts               407 non-null    int64  
 12  blues                   407 non-null    float64
 13  reds                    407 non-null    float64
 14  blacks                  407 non-null    float64
 15  totalRuns               407 non-null    float64
 16  link                    407 non-null    object 
 17  sleeps                  407 non-null    object 
 18  decSnowLow2020(cm)      407 non-null    object 
 19  decSnowHigh2020(cm)     407 non-null    object 
 20  janSnowLow2020(cm)      407 non-null    object 
 21  janSnowHigh2020(cm)     407 non-null    object 
 22  febSnowLow2020(cm)      407 non-null    object 
 23  febSnowHigh2020(cm)     407 non-null    object 
dtypes: float64(4), int64(8), object(12)
memory usage: 76.4+ KB
Out[7]:
Unnamed: 0 country resort hotel price (£) distance_from_lift_(m) altitude (m) totalPiste (km) totalLifts gondolas chairlifts draglifts blues reds blacks totalRuns link sleeps decSnowLow2020(cm) decSnowHigh2020(cm) janSnowLow2020(cm) janSnowHigh2020(cm) febSnowLow2020(cm) febSnowHigh2020(cm)
0 0 italy bardonecchia residence-tabor 550 unknown 1312 140 23 0 9 14 17.0 24.0 1.0 42.0 https://www.igluski.com/ski-resorts/italy/bard... unknown 70 170 60 200 47 177
1 1 italy bardonecchia residence-villa-frejus 561 unknown 1312 140 23 0 9 14 17.0 24.0 1.0 42.0 https://www.igluski.com/ski-resorts/italy/bard... unknown 70 170 60 200 47 177
2 2 bulgaria bansko hotel-mura 566 1100 935 70 24 1 7 16 5.0 8.0 1.0 14.0 https://www.igluski.com/ski-resorts/bulgaria/b... 140 0 70 5 75 5 77
3 3 bulgaria borovets hotel-samokov 574 75 1390 58 18 1 4 13 1.0 13.0 1.0 15.0 https://www.igluski.com/ski-resorts/bulgaria/b... 590 0 0 160 200 194 246
4 4 bulgaria bansko hotel-lion---bansko 596 800 935 70 24 1 7 16 5.0 8.0 1.0 14.0 https://www.igluski.com/ski-resorts/bulgaria/b... 386 0 70 5 75 5 77

2. Selecting only Numeric Variables for Clustering Analysis

In [8]:
hotel_cluster = hotel[['country','price (£)', 'distance_from_lift_(m)', 'altitude (m)','totalPiste (km)',
                       'totalLifts', 'gondolas', 'chairlifts', 'draglifts', 'blues', 'reds',
                       'blacks', 'totalRuns', 'sleeps', 'decSnowLow2020(cm)','decSnowHigh2020(cm)',
                       'janSnowLow2020(cm)', 'janSnowHigh2020(cm)','febSnowLow2020(cm)',
                       'febSnowHigh2020(cm)']]

3. Checking for dataset missing values and unrelevant value ("unknown")

In [9]:
# Checking for missing value
hotel_cluster.isna().sum()
Out[9]:
country                   0
price (£)                 0
distance_from_lift_(m)    0
altitude (m)              0
totalPiste (km)           0
totalLifts                0
gondolas                  0
chairlifts                0
draglifts                 0
blues                     0
reds                      0
blacks                    0
totalRuns                 0
sleeps                    0
decSnowLow2020(cm)        0
decSnowHigh2020(cm)       0
janSnowLow2020(cm)        0
janSnowHigh2020(cm)       0
febSnowLow2020(cm)        0
febSnowHigh2020(cm)       0
dtype: int64

No "missing" value detected, but looking at the head and unique function output, there are couples of "unknown" value in 'distance_fromlift(m)' and 'sleeps' variables.

In [10]:
# Checking for "unknown" value for each variables
unknown_counts = []

for col in hotel_cluster.columns:
    if 'unknown' in hotel_cluster[col].unique():
        unknown_count = hotel_cluster[col].value_counts()['unknown']
        unknown_percent = (unknown_count / hotel_cluster.shape[0]) * 100
    else:
        unknown_percent = 0
    unknown_counts.append((col, unknown_percent))

table = tabulate(unknown_counts, headers=['Column', 'Unknown %'], tablefmt='grid')

print(table)
+------------------------+-------------+
| Column                 |   Unknown % |
+========================+=============+
| country                |      0      |
+------------------------+-------------+
| price (£)              |      0      |
+------------------------+-------------+
| distance_from_lift_(m) |     47.1744 |
+------------------------+-------------+
| altitude (m)           |      0      |
+------------------------+-------------+
| totalPiste (km)        |      0      |
+------------------------+-------------+
| totalLifts             |      0      |
+------------------------+-------------+
| gondolas               |      0      |
+------------------------+-------------+
| chairlifts             |      0      |
+------------------------+-------------+
| draglifts              |      0      |
+------------------------+-------------+
| blues                  |      0      |
+------------------------+-------------+
| reds                   |      0      |
+------------------------+-------------+
| blacks                 |      0      |
+------------------------+-------------+
| totalRuns              |      0      |
+------------------------+-------------+
| sleeps                 |     23.5872 |
+------------------------+-------------+
| decSnowLow2020(cm)     |     13.7592 |
+------------------------+-------------+
| decSnowHigh2020(cm)    |     13.7592 |
+------------------------+-------------+
| janSnowLow2020(cm)     |     13.7592 |
+------------------------+-------------+
| janSnowHigh2020(cm)    |     13.7592 |
+------------------------+-------------+
| febSnowLow2020(cm)     |     13.7592 |
+------------------------+-------------+
| febSnowHigh2020(cm)    |     13.7592 |
+------------------------+-------------+

Observed Variable with "Unknown" Values:

  • distance_fromlift(m)
  • sleeps
  • decSnowLow2020(cm)
  • decSnowHigh2020(cm)
  • janSnowLow2020(cm)
  • janSnowHigh2020(cm)
  • febSnowLow2020(cm)
  • febSnowHigh2020(cm)

4. Checking the magnitude of "unknown" value to the dataset

In [11]:
a = hotel[['country', 'resort', 'hotel','distance_from_lift_(m)','sleeps']]
b = a[a['distance_from_lift_(m)'] == 'unknown']

# how many resorts in a specific country that has "unknown" distance from lift value"
print(b['country'].value_counts())

print("")

# how many resorts in a specific country in total
print(hotel['country'].value_counts())

print("")

# percentage of resorts that have unknown distance to lift value in each country
print(b['country'].value_counts()/hotel['country'].value_counts())
austria     85
italy       59
france      25
finland     14
andorra      8
bulgaria     1
Name: country, dtype: int64

austria     169
italy       105
france       77
andorra      29
finland      16
bulgaria     11
Name: country, dtype: int64

andorra     0.275862
austria     0.502959
bulgaria    0.090909
finland     0.875000
france      0.324675
italy       0.561905
Name: country, dtype: float64

To create the cluster, it would be better to exclude the "distance_fromlift(m)" and "sleeps" variables. These two variables are specific to each resort and country, making them unsuitable for imputation. On the other hand, the "snow" variable can be imputed because it's a naturally occurring phenomenon that can be predicted using available data.

It's not recommended to drop the observations/rows with "unknown" values because they constitute a significant portion of the data for each country

5. Dropping and Imputing Variables

In [12]:
# Dropping Variables
hotel_cluster.drop(columns=['distance_from_lift_(m)', 'sleeps'], inplace=True)
In [13]:
# loop over column names to remove "unknown" values and convert to numeric
cols = ['decSnowLow2020(cm)', 'decSnowHigh2020(cm)', 'janSnowLow2020(cm)',
        'janSnowHigh2020(cm)', 'febSnowLow2020(cm)', 'febSnowHigh2020(cm)']

snow = pd.DataFrame(columns = ['decSnowLow2020(cm)', 'decSnowHigh2020(cm)', 'janSnowLow2020(cm)',
        'janSnowHigh2020(cm)', 'febSnowLow2020(cm)', 'febSnowHigh2020(cm)'])

for col in cols:
    snow[col] = pd.to_numeric(hotel_cluster[col], errors='coerce')
    #snow = snow.dropna(subset=[col])

snow['country'] = hotel['country']

# Taking the observation mean for each variable for imputation value
print(snow.groupby('country').mean())
          decSnowLow2020(cm)  decSnowHigh2020(cm)  janSnowLow2020(cm)  \
country                                                                 
andorra            20.833333            47.777778           99.444444   
austria            16.944444            73.783951           51.524691   
bulgaria            0.000000            38.181818           57.272727   
finland            35.818182            50.454545           71.363636   
france             57.807018           117.526316          131.491228   
italy              67.500000           187.391304           50.347826   

          janSnowHigh2020(cm)  febSnowLow2020(cm)  febSnowHigh2020(cm)  
country                                                                 
andorra            170.833333           78.444444           151.666667  
austria            132.265432           51.679012           147.179012  
bulgaria           110.909091           85.818182           140.000000  
finland             77.727273           85.909091           100.454545  
france             240.070175          126.771930           235.368421  
italy              201.206522           58.630435           202.945652  
In [14]:
# Converting snow information in the hotel_cluster dataset into numeric
for col in cols:
    hotel_cluster[col] = pd.to_numeric(hotel_cluster[col], errors='coerce')
In [15]:
# Imputing Snow value for each of the missing value for each of the country
snow_mean = snow.groupby('country').mean()
for col in cols:
    for idx, row in hotel_cluster.iterrows():
        if row['country'] in snow_mean.index:
            if pd.isna(row[col]) or row[col] == 'unknown':
                hotel_cluster.loc[idx, col] = snow_mean.loc[row['country'], col]

# Removing the intermediary 'country' variable
hotel_cluster.drop(columns=['country'], inplace=True)

6. Scaling the Data

In [16]:
# Scaling the hotel_cluster dataframe by standardization and assigning it to new variable named scaled_hotel_cluster
scaler = preprocessing.StandardScaler()
scaled_hotel_cluster = scaler.fit_transform(hotel_cluster)
scaled_hotel_cluster = pd.DataFrame(scaled_hotel_cluster)

# Changing the colnames to its original value
scaled_hotel_cluster.columns = hotel_cluster.columns

7. Constructing Elbow Chart

In [17]:
# Initializing empty list for SSE calculation
sse = {}
for k in range(1, 11):
# Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=296)
# Fit KMeans on the normalized dataset
    kmeans.fit(scaled_hotel_cluster)
    sse[k] = kmeans.inertia_
# Add the plot title "The Elbow Method"
plt.title('The Elbow Method')
# Add X-axis label "k"
plt.xlabel('k')
# Add Y-axis label "SSE"
plt.ylabel('SSE')
sns.pointplot(x=list(sse.keys()), y=list(sse.values()));

Looking at the above elbow chart, the SSE seems to drastically drop when transitioning from k=1 to k=2 while the rest of the k seems only contribute to small decrease in the SSE. We would need to inspect the detail for k=3 to k=5 to check which number of cluster that might be best for the conference audience to understand.

8. Finding the Number of k

In [18]:
# Number of cluster equals to 3
kmeans3 = KMeans(n_clusters=3, random_state=296)
kmeans3.fit(scaled_hotel_cluster)
cluster_labels3 = kmeans3.labels_

k_means3 = scaled_hotel_cluster.assign(Cluster = cluster_labels3)
k_means3.groupby(['Cluster']).agg({
'price (£)': ['mean','std'],
'altitude (m)':['mean','std'],
'totalPiste (km)': ['mean','std'],
'totalLifts': ['mean','std'],
'gondolas': ['mean','std'],
'chairlifts': ['mean','std'],
'draglifts': ['mean','std'],
'blues': ['mean','std'],
'reds': ['mean','std'],
'blacks': ['mean','std'],
'totalRuns': ['mean','std'],
'decSnowLow2020(cm)': ['mean','std'],
'decSnowHigh2020(cm)': ['mean','std'],
'janSnowLow2020(cm)': ['mean','std'],
'janSnowHigh2020(cm)': ['mean','std'],
'febSnowLow2020(cm)': ['mean','std'],
'febSnowHigh2020(cm)': ['mean','std','count'],
}).round(2)
Out[18]:
price (£) altitude (m) totalPiste (km) totalLifts gondolas chairlifts draglifts blues reds blacks totalRuns decSnowLow2020(cm) decSnowHigh2020(cm) janSnowLow2020(cm) janSnowHigh2020(cm) febSnowLow2020(cm) febSnowHigh2020(cm)
mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std count
Cluster
0 -0.16 0.98 -0.24 0.97 -0.39 0.72 -0.42 0.60 -0.33 0.52 -0.40 0.67 -0.37 0.64 -0.42 0.40 -0.46 0.47 -0.50 0.52 -0.50 0.43 -0.26 0.78 -0.01 1.08 -0.34 0.61 -0.27 0.93 -0.36 0.64 -0.27 0.98 287
1 0.92 1.24 0.93 1.26 2.31 0.00 3.23 0.16 3.76 0.15 2.60 0.03 2.68 0.29 2.87 1.28 2.01 0.97 1.53 0.16 2.61 0.20 1.53 0.00 0.17 0.00 2.92 0.00 1.31 0.00 2.80 0.00 1.31 0.00 14
2 0.31 0.88 0.52 0.75 0.75 0.88 0.71 0.73 0.41 0.87 0.73 0.89 0.64 1.02 0.75 1.04 0.98 1.03 1.16 0.87 1.01 0.85 0.51 1.21 0.02 0.82 0.52 1.07 0.55 0.88 0.62 0.98 0.55 0.72 106

Looking at the result of the first cluster formation trial with k=3, the data is very nicely splitted between the variables suggesting the very distinct three cluster formation with clear difference between cluster.

In [19]:
# Number of cluster equals to 4
kmeans4 = KMeans(n_clusters=4, random_state=296)
kmeans4.fit(scaled_hotel_cluster)
cluster_labels4 = kmeans4.labels_

k_means4 = scaled_hotel_cluster.assign(Cluster = cluster_labels4)
k_means4.groupby(['Cluster']).agg({
'price (£)': ['mean','std'],
'altitude (m)':['mean','std'],
'totalPiste (km)': ['mean','std'],
'totalLifts': ['mean','std'],
'gondolas': ['mean','std'],
'chairlifts': ['mean','std'],
'draglifts': ['mean','std'],
'blues': ['mean','std'],
'reds': ['mean','std'],
'blacks': ['mean','std'],
'totalRuns': ['mean','std'],
'decSnowLow2020(cm)': ['mean','std'],
'decSnowHigh2020(cm)': ['mean','std'],
'janSnowLow2020(cm)': ['mean','std'],
'janSnowHigh2020(cm)': ['mean','std'],
'febSnowLow2020(cm)': ['mean','std'],
'febSnowHigh2020(cm)': ['mean','std','count'],
}).round(2)
Out[19]:
price (£) altitude (m) totalPiste (km) totalLifts gondolas chairlifts draglifts blues reds blacks totalRuns decSnowLow2020(cm) decSnowHigh2020(cm) janSnowLow2020(cm) janSnowHigh2020(cm) febSnowLow2020(cm) febSnowHigh2020(cm)
mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std count
Cluster
0 -0.15 0.97 -0.25 0.99 -0.34 0.75 -0.38 0.60 -0.27 0.60 -0.38 0.68 -0.34 0.64 -0.39 0.42 -0.40 0.52 -0.44 0.58 -0.45 0.47 -0.26 0.79 -0.21 0.75 -0.29 0.60 -0.45 0.56 -0.33 0.63 -0.45 0.57 280
1 0.22 0.89 0.89 0.44 0.64 0.83 0.73 0.48 -0.07 0.58 0.75 0.70 0.87 0.60 0.29 0.61 0.42 0.60 0.98 0.64 0.49 0.25 1.14 1.14 0.45 0.75 0.87 1.13 1.14 0.56 1.14 0.77 1.02 0.42 64
2 0.66 0.98 0.01 0.96 1.41 1.06 1.60 1.40 1.93 1.38 1.44 1.25 1.18 1.59 2.20 1.11 2.15 0.83 1.60 0.97 2.35 0.30 0.14 0.99 -0.49 0.49 0.92 1.49 0.16 0.82 0.76 1.51 0.20 0.78 46
3 -0.17 1.15 0.66 0.52 -0.63 0.14 -0.75 0.04 -0.52 0.29 -0.40 0.09 -0.87 0.03 -0.66 0.11 -0.82 0.04 -0.71 0.16 -0.82 0.06 -0.44 0.32 3.05 0.84 -0.93 0.15 2.67 0.63 -0.98 0.15 2.98 0.27 17

Looking at the result of the second cluster formation trial with k=4, the distinction between cluster is not as obvious as when k equals to three as we could see cluster 0 and 3 has quite similar characteristics and needs to split better to better distinguish the two clusters.

In [20]:
# Number of cluster equals to 5
kmeans5 = KMeans(n_clusters=5, random_state=296)
kmeans5.fit(scaled_hotel_cluster)
cluster_labels5 = kmeans5.labels_

k_means5 = scaled_hotel_cluster.assign(Cluster = cluster_labels5)
k_means5.groupby(['Cluster']).agg({
'price (£)': ['mean','std'],
'altitude (m)':['mean','std'],
'totalPiste (km)': ['mean','std'],
'totalLifts': ['mean','std'],
'gondolas': ['mean','std'],
'chairlifts': ['mean','std'],
'draglifts': ['mean','std'],
'blues': ['mean','std'],
'reds': ['mean','std'],
'blacks': ['mean','std'],
'totalRuns': ['mean','std'],
'decSnowLow2020(cm)': ['mean','std'],
'decSnowHigh2020(cm)': ['mean','std'],
'janSnowLow2020(cm)': ['mean','std'],
'janSnowHigh2020(cm)': ['mean','std'],
'febSnowLow2020(cm)': ['mean','std'],
'febSnowHigh2020(cm)': ['mean','std','count'],
}).round(2)
Out[20]:
price (£) altitude (m) totalPiste (km) totalLifts gondolas chairlifts draglifts blues reds blacks totalRuns decSnowLow2020(cm) decSnowHigh2020(cm) janSnowLow2020(cm) janSnowHigh2020(cm) febSnowLow2020(cm) febSnowHigh2020(cm)
mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std count
Cluster
0 0.32 0.88 0.69 0.57 0.65 0.78 0.76 0.49 0.08 0.60 0.89 0.73 0.76 0.71 0.58 0.88 0.79 0.98 1.31 0.94 0.86 0.79 0.90 1.14 0.21 0.82 0.84 1.02 0.85 0.79 1.02 0.75 0.80 0.59 80
1 -0.05 0.98 -0.31 0.92 0.09 0.85 0.04 0.37 0.17 0.68 0.01 0.49 -0.01 0.56 -0.01 0.70 0.04 0.78 -0.12 0.59 -0.01 0.68 -0.56 0.56 -0.35 0.63 -0.41 0.46 -0.44 0.49 -0.51 0.37 -0.43 0.52 179
2 -0.17 1.15 0.66 0.52 -0.63 0.14 -0.75 0.04 -0.52 0.29 -0.40 0.09 -0.87 0.03 -0.66 0.11 -0.82 0.04 -0.71 0.16 -0.82 0.06 -0.44 0.32 3.05 0.84 -0.93 0.15 2.67 0.63 -0.98 0.15 2.98 0.27 17
3 -0.24 0.97 -0.20 1.06 -0.86 0.27 -0.98 0.29 -0.73 0.47 -0.99 0.39 -0.84 0.32 -0.71 0.21 -0.80 0.22 -0.84 0.26 -0.86 0.22 0.14 0.88 -0.06 0.87 -0.14 0.74 -0.48 0.63 -0.08 0.83 -0.49 0.62 114
4 0.72 1.23 0.67 1.28 2.36 0.12 3.32 0.25 3.20 1.26 2.72 0.28 2.99 0.75 2.85 1.16 2.03 0.88 1.47 0.21 2.60 0.18 1.13 0.88 0.01 0.35 2.29 1.39 1.13 0.41 2.12 1.50 1.03 0.63 17

Looking at the result from cluster formation with k=5, the data seems to split nicely in some of the variables but not that distinctive for couple of variables suggesting that some clusters have quite similar characteristics in one variable but very different characteristics in the other variables.

For the Lobsterland Conference, it is recommended to present to the audience the result of the three clusters because of its simplicity and easy to understand characteristics thus making it more intuitive for the broader audience to grasp the concept.

We could later present the finding of the five clusters for deeper dive class/session where the interested participant might want to learn the other cluster in order to maximize their profits.

9. Naming the Clusters

Cluster 0: The Alpine Haven

Cluster of being in a cozy, comfortable place amidst the rugged beauty of the mountains. This cluster the average price compared to the other two clusters, has the highest altitude, has the average total piste, average total lifts, average number of gondolas, average number of chair lifts, average number of drag lifts, average number of blues red and blacks runs, average total runs, highest snow thickness in December, relatively thick snow in January and February.

It is nicknamed "Alpine" due to its highest altitude on being on the top of the mountain and "Haven" at the same time due to its affordability but could still provide decent numbers and various options of lifts, and piste length for the occupants as well as superb thickness of the snow during the winter season, making it a "Haven" for ski lover.

This cluster might be well suited for skiier with medium to high budget who want to spends a bit more money to enjoy a decent experience of skiing.

Cluster 1: The Slope Saver

Cluster of affordability and suggests that this cluster might be a good choice for skiers on a budget. The cluster has the lowest price, lowest altitude, lowest total piste, lowest total lifts, lowest number of gondolas, lowest number of chair lifts, lower number of drag lifts, lowest blues reds and blacks runs, lowest total runs, and lowest thickness of snow in all of winter months.

The cluster resort might offer a different kind of skiing experience focused on the natural beauty of the surrounding valley instead of the high peak altitude of the mountain while making most of the thinner snow conditions by seeking out the best spots on the mountain.

This resort cluster is best suited for skiier who wants to maximize the skiing months with limited budget or for the skiier who might want to maximize their available budget by doing "ski-hop" between resorts.

Cluster 2: The Glacier Elite

Cluster of reaching the highest levels of skiing ability, and suggests that this cluster might be the best choice for serious skiers looking to push themselves to the limit.high-end amenities and the idea of exclusivity, suggesting that this might be the ideal choice for skiers looking for a more upscale experience.

This cluster has the highest price, average altitude, highest total piste, highest total lifts, highest number of gondolas, highest number of chair lifts, highest number of drag lifts, highest blues reds and blacks runs, highest total runs, average snow thickness during winter months.

This resort cluster might be best suited for skiier with high-end budget and high level of skiing proficiency looking for a more serious challenge.

10. Preparing Clustered Dataset for Visualization

In [21]:
# Assigning back the cluster label to hotel_cluster dataset columns
cluster = hotel_cluster.assign(Cluster = cluster_labels3)
cluster['Cluster'] = cluster['Cluster'].astype('category')

# Renaming the cluster
cluster['Cluster'] = cluster['Cluster'].cat.rename_categories({
    0:'The Alpine Haven',
    1:'The Slope Saver',
    2:'The Glacier Elite',
})
In [22]:
# Adding back the dropped variable from the original hotel dataset into cluster dataset
cluster['country'] = hotel['country']
cluster['resort'] = hotel['resort']
cluster['hotel'] = hotel['hotel']
cluster['resort_link'] = hotel['link']

# reposition the columns
cluster = cluster.reindex(columns=['country', 'resort', 'hotel', 'Cluster', 'price (£)',
                                  'altitude (m)', 'totalPiste (km)', 'totalLifts',
                                  'gondolas', 'chairlifts', 'draglifts', 'blues', 'reds',
                                  'blacks','totalRuns', 'decSnowLow2020(cm)',
                                  'decSnowHigh2020(cm)','janSnowLow2020(cm)', 'janSnowHigh2020(cm)',
                                   'febSnowLow2020(cm)','febSnowHigh2020(cm)'])

# merging the snow data for each month
cluster['snow_dec'] = (cluster['decSnowLow2020(cm)'] + cluster['decSnowHigh2020(cm)'])/2
cluster['snow_jan'] = (cluster['janSnowLow2020(cm)'] + cluster['janSnowHigh2020(cm)'])/2
cluster['snow_feb'] = (cluster['febSnowLow2020(cm)'] + cluster['febSnowHigh2020(cm)'])/2
In [23]:
# dropping duplicate variables
cluster.drop(columns=['decSnowLow2020(cm)','decSnowHigh2020(cm)','janSnowLow2020(cm)',
                      'janSnowHigh2020(cm)', 'febSnowLow2020(cm)','febSnowHigh2020(cm)'], inplace=True)

11. Visualizing Clusters: Total Piste vs Altitude

In [24]:
# Compute mean total piste length and mean altitude for each cluster
cluster_means = cluster.groupby('Cluster')[['totalPiste (km)', 'altitude (m)']].mean()
In [25]:
# Plotting scatterplot between Total Piste vs Altitude
plt.figure(figsize=(7.5,4))
plt.xlabel("Total Piste Length (km)")
plt.ylabel("Altitude (m)")
plt.title("Resort Total Piste vs Altitude")
sns.despine()
pl = sns.scatterplot(
    x='totalPiste (km)',
    y='altitude (m)',
    s=100,
    hue='Cluster',
    data=cluster)

# Define centroid colors
cluster_colors = ['steelblue','darkorange','green']

# Add centroid (mean) points and labels for each cluster
for i in range(len(cluster_means)):
    plt.plot(cluster_means.iloc[i]['totalPiste (km)'], cluster_means.iloc[i]['altitude (m)'],
             marker='s', markersize=15, color=cluster_colors[i])

12. Visualizing Clusters: Snow Thickness Trend

In [26]:
sns.catplot(y='Cluster', x='value', col='variable', kind='box',
            data=pd.melt(cluster[['Cluster', 'snow_dec', 'snow_jan', 'snow_feb']], id_vars=['Cluster']));

13. Visualizing Cluster: Number of Runs

In [27]:
# Reshape data using melt
melted_df = cluster[['blues', 'reds', 'blacks', 'Cluster']].melt(id_vars='Cluster', var_name='Variable', value_name='Value')

# Convert Cluster column to categorical
melted_df['Cluster'] = pd.Categorical(melted_df['Cluster'])

# Plot bar chart
sns.barplot(x='Variable', y='Value', hue='Cluster', data=melted_df, ci=None)
plt.title("Number of Runs in Each Resorts Cluster");

14. Visualizing Cluster: Prices

In [28]:
plt.figure(figsize=(10, 6))
sns.barplot(x='Cluster', y='price (£)', hue='Cluster', data=cluster, dodge=False, ci=None)
plt.xlabel("Cluster")
plt.ylabel("Price per Night (£)")
plt.title("Cluster vs Price per Night (£)");

15. Cluster Visualization: Cluster GIS The Alpine Haven

In [29]:
# Convert all country names to title case
cluster['country'] = cluster['country'].str.title()
# Create a dictionary of country names and their ISO Alpha-3 codes
iso_codes = {country.name: country.alpha_3 for country in pycountry.countries}
# Replace the country names with their ISO Alpha-3 codes in the 'country' Series
cluster['iso_codes'] = cluster['country'].replace(iso_codes)
In [30]:
# Filter the rows that correspond to 'The Alpine Haven' cluster
alpine_haven = cluster[cluster['Cluster'] == 'The Alpine Haven']

# Group the rows by country and count the size of each group
country_counts_ah = alpine_haven.groupby('country').size()

# Reset the index to turn the resulting Series into a DataFrame
country_counts_ah = country_counts_ah.reset_index()

# Map the country names to their ISO Alpha-3 codes
country_counts_ah['iso_codes'] = country_counts_ah['country'].map(iso_codes)

# Rename the columns to 'country', 'count', and 'iso_codes'
country_counts_ah.columns = ['country', 'count', 'iso_codes']

# Create a choropleth map using plotly
fig = px.choropleth(country_counts_ah,
                    locations="iso_codes",
                    color="count",
                    hover_name="country",
                    projection="mercator",
                    title="The Alpine Haven: Country Count"
                   )
fig.show()

16. Cluster Visualization: Cluster GIS The Slope Saver

In [31]:
# Filter the rows that correspond to 'The Slope Saver' cluster
slope_saver = cluster[cluster['Cluster'] == 'The Slope Saver']

# Group the rows by country and count the size of each group
country_counts_ss = slope_saver.groupby('country').size()

# Reset the index to turn the resulting Series into a DataFrame
country_counts_ss = country_counts_ss.reset_index()

# Map the country names to their ISO Alpha-3 codes
country_counts_ss['iso_codes'] = country_counts_ss['country'].map(iso_codes)

# Rename the columns to 'country', 'count', and 'iso_codes'
country_counts_ss.columns = ['country', 'count', 'iso_codes']

# Create a choropleth map using plotly
fig = px.choropleth(country_counts_ss,
                    locations="iso_codes",
                    color="count",
                    hover_name="country",
                    projection="mercator",
                    title="The Slope Saver: Country Count"
                   )
fig.show()

17. Cluster Visualization: Cluster GIS The Glacier Elite

In [32]:
# Filter the rows that correspond to 'The Glacier Elite' cluster
glacier_elite = cluster[cluster['Cluster'] == 'The Glacier Elite']

# Group the rows by country and count the size of each group
country_counts_ge = glacier_elite.groupby('country').size()

# Reset the index to turn the resulting Series into a DataFrame
country_counts_ge = country_counts_ge.reset_index()

# Map the country names to their ISO Alpha-3 codes
country_counts_ge['iso_codes'] = country_counts_ge['country'].map(iso_codes)

# Rename the columns to 'country', 'count', and 'iso_codes'
country_counts_ge.columns = ['country', 'count', 'iso_codes']

# Create a choropleth map using plotly
fig = px.choropleth(country_counts_ge,
                    locations="iso_codes",
                    color="count",
                    hover_name="country",
                    projection="mercator",
                    title="The Glacier Elite: Country Count"
                   )
fig.show()

18. Cluster Visualization: Total Piste vs Total Runs

In [33]:
# Plotting scatterplot between Total Piste vs Total Runs
plt.figure(figsize=(7.5,4))
plt.xlabel("Total Piste Length (km)")
plt.ylabel("Total Runs")
plt.title("Resort Total Piste vs Total Runs")
sns.despine()
pl2 = sns.scatterplot(
    x='totalPiste (km)',
    y='totalRuns',
    s=100,
    hue='Cluster',
    data=cluster)

Part 4 Conjoint Analysis and Memo Section¶


1. Importing Datasets

In [34]:
costs = pd.read_csv("amenity_costs.csv")
amenity = pd.read_csv("hotel_amenities.csv")

2. Checking for NA Values

In [35]:
amenity.isna().any()
Out[35]:
WiFi_Network    False
breakfast       False
parking         False
gym             False
flex_check      False
shuttle_bus     False
air_pure        False
jacuzzi         False
VIP_shop        False
pool_temp       False
avg_rating      False
dtype: bool

3. Checking for Unique Value in Each Variables

In [36]:
for i in amenity.columns[0:10]:
    print(i,amenity[i].unique())
WiFi_Network ['Basic' 'Strong' 'Best in Class']
breakfast ['None' 'Continental' 'Full Buffet']
parking ['Valet' 'Open Lot']
gym ['None' 'Basic' 'Advanced' 'Super']
flex_check ['No' 'Yes']
shuttle_bus ['No' 'Yes']
air_pure ['No' 'Yes']
jacuzzi ['No' 'Yes']
VIP_shop ['No' 'Yes']
pool_temp [76 80 84]

4. Dummifying Variables

In [37]:
amenity2 = pd.get_dummies(amenity, columns = ['WiFi_Network', 'breakfast', 'parking', 'gym',
                                                                 'flex_check', 'shuttle_bus', 'air_pure', 'jacuzzi',
                                                                 'VIP_shop', 'pool_temp'])
amenity2.head()
Out[37]:
avg_rating WiFi_Network_Basic WiFi_Network_Best in Class WiFi_Network_Strong breakfast_Continental breakfast_Full Buffet breakfast_None parking_Open Lot parking_Valet gym_Advanced gym_Basic gym_None gym_Super flex_check_No flex_check_Yes shuttle_bus_No shuttle_bus_Yes air_pure_No air_pure_Yes jacuzzi_No jacuzzi_Yes VIP_shop_No VIP_shop_Yes pool_temp_76 pool_temp_80 pool_temp_84
0 4.57 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0
1 7.60 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0
2 5.66 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1
3 2.80 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0
4 4.56 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0

5. Building Linear Regression Model

In [38]:
X = amenity2[['WiFi_Network_Strong', 'WiFi_Network_Best in Class',
       'breakfast_Continental', 'breakfast_Full Buffet', 'parking_Valet', 'gym_Basic',
       'gym_Advanced', 'gym_Super', 'flex_check_Yes', 'shuttle_bus_Yes',
       'air_pure_Yes', 'jacuzzi_Yes', 'VIP_shop_Yes', 'pool_temp_80',
       'pool_temp_84']]
y = amenity2['avg_rating']
In [39]:
regressor = LinearRegression()
regressor.fit(X, y)
Out[39]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

6. Checking the Linear Regression Coefficient on the Amenities

In [40]:
coef_df = pd.DataFrame(regressor.coef_, X.columns, columns = ['Coefficient'])
coef_df
Out[40]:
Coefficient
WiFi_Network_Strong 1.187700
WiFi_Network_Best in Class 1.726814
breakfast_Continental 0.225924
breakfast_Full Buffet 0.726923
parking_Valet 0.093678
gym_Basic 0.022170
gym_Advanced 0.168484
gym_Super 0.212847
flex_check_Yes 0.478220
shuttle_bus_Yes 0.419939
air_pure_Yes 0.075258
jacuzzi_Yes 0.183909
VIP_shop_Yes 0.217925
pool_temp_80 0.074744
pool_temp_84 0.263806

7. Checking NA Values in Cost Dataset

In [41]:
costs.isna().any()
Out[41]:
Amenity                                               False
Level                                                 False
Estimated Incremental Cost,\nPer Visitor/Per Night    False
dtype: bool

8. Preparing Cost Data

In [42]:
costs.loc[costs['Amenity'] == 'pool temp', 'Amenity'] = 'pool_temp'
row1 = costs.loc[6]
row2 = costs.loc[7]

# Swap their positions using indexing
costs.loc[6] = row2
costs.loc[7] = row1

9. Splitting Linear Regression Coefficient Dataframe

In [43]:
split = coef_df.index.str.split('_', expand=True)
In [44]:
split_df = split.to_frame()
In [45]:
split_df.index = range(len(split_df))

10. Creating Level Combination Dataframe for Amenities

In [46]:
split_df[['Amenity', 'Level']] = split_df.apply(
    lambda row: (
        (row[0] + '_' + row[1], row[2]) if not pd.isna(row[2]) else (row[0], row[1])
    ),
    axis=1,
    result_type='expand'
)
split_df = split_df.drop(columns=[0, 1, 2])
In [47]:
split_df['Coefficient'] = coef_df.Coefficient.values

11. Merging Cost Data and Level Amenities Combination Data

In [48]:
merged_df = costs.merge(split_df, on=['Amenity', 'Level'], how='outer')
merged_df = merged_df.fillna(0)
merged_df
Out[48]:
Amenity Level Estimated Incremental Cost,\nPer Visitor/Per Night Coefficient
0 WiFi_Network Basic 11.75 0.000000
1 WiFi_Network Strong 16.25 1.187700
2 WiFi_Network Best in Class 19.15 1.726814
3 breakfast None 0.00 0.000000
4 breakfast Continental 13.25 0.225924
5 breakfast Full Buffet 22.45 0.726923
6 parking Open Lot 15.00 0.000000
7 parking Valet 60.00 0.093678
8 gym None 0.00 0.000000
9 gym Basic 10.00 0.022170
10 gym Advanced 35.00 0.168484
11 gym Super 65.00 0.212847
12 flex_check No 0.00 0.000000
13 flex_check Yes 12.00 0.478220
14 shuttle_bus No 0.00 0.000000
15 shuttle_bus Yes 75.00 0.419939
16 air_pure No 0.00 0.000000
17 air_pure Yes 12.85 0.075258
18 jacuzzi No 0.00 0.000000
19 jacuzzi Yes 40.00 0.183909
20 VIP_shop No 0.00 0.000000
21 VIP_shop Yes 12.00 0.217925
22 pool_temp 76 15.00 0.000000
23 pool_temp 80 35.00 0.074744
24 pool_temp 84 45.00 0.263806

12. Creating All Possible Combination of Amenities along with Cost

In [49]:
# Group the data by amenity
grouped = merged_df.groupby("Amenity")

# Create a list of all possible combinations of amenities
combinations = [list(group.index) for _, group in grouped]
all_combinations = list(itertools.product(*combinations))

best_coefficient = 0
best_combination = None

for combination in all_combinations:
    selected_amenities = merged_df.iloc[list(combination)]
    total_cost = selected_amenities["Estimated Incremental Cost,\nPer Visitor/Per Night"].sum()
    total_coefficient = selected_amenities["Coefficient"].sum()

    # Check if the current combination is better and meets the cost constraint
    if total_cost <= 150 and total_coefficient > best_coefficient:
        best_coefficient = total_coefficient
        best_combination = selected_amenities

# Print the best combination
print("Best combination:")
print(best_combination[["Amenity", "Level", "Estimated Incremental Cost,\nPer Visitor/Per Night", "Coefficient"]])

# Print the total cost and total coefficient
print("\nTotal cost:", best_combination["Estimated Incremental Cost,\nPer Visitor/Per Night"].sum())
print("Total coefficient:", best_combination["Coefficient"].sum())
Best combination:
         Amenity          Level  \
21      VIP_shop            Yes   
2   WiFi_Network  Best in Class   
17      air_pure            Yes   
5      breakfast    Full Buffet   
13    flex_check            Yes   
9            gym          Basic   
18       jacuzzi             No   
6        parking       Open Lot   
24     pool_temp             84   
14   shuttle_bus             No   

    Estimated Incremental Cost,\nPer Visitor/Per Night  Coefficient  
21                                              12.00      0.217925  
2                                               19.15      1.726814  
17                                              12.85      0.075258  
5                                               22.45      0.726923  
13                                              12.00      0.478220  
9                                               10.00      0.022170  
18                                               0.00      0.000000  
6                                               15.00      0.000000  
24                                              45.00      0.263806  
14                                               0.00      0.000000  

Total cost: 148.45
Total coefficient: 3.5111168981481504

Part 6 Strategic Memo¶


After analyzing the Golden Arch Hotel case study, I have gathered insights on the strengths, weaknesses, opportunities, and threats (SWOT) that McDonald's faced when entering the hotel market. This analysis will help inform recommendations for Lobster Land's potential overseas expansion. SWOT Analysis for Golden Arch Hotel:

Strengths:

  • Strong brand recognition of McDonald's worldwide. (McDonald's is a globally recognized brand that benefits from high levels of customer familiarity and trust)
  • Competitive pricing strategy. (The Golden Arch Hotel offered competitive room rates, making it attractive to price-conscious travelers)
  • Technologically advanced amenities for the time. (The hotel featured modern amenities such as high-speed internet and TV with wireless keyboard, which catered to the demands of its target market.)

Weaknesses:

  • Inconsistent customer experience across various aspects of the hotel stay.(The Golden Arch Hotel struggled to maintain a consistent level of quality across various aspects of the hotel experience, resulting in mixed customer feedback)
  • Incongruent brand image between McDonald's and a four-star hotel. (The brand association between McDonald's and a four-star hotel created confusion among customers who expected a more upscale experience)
  • Limited food options and poor ambiance in the bar and restaurant area. (The hotel's restaurant and bar failed to meet customer expectations, as they lacked variety and ambiance, undermining the overall experience.)

Opportunities: ·       Capitalize on the established McDonald's brand to attract customers. ·       Collaboration with airlines, tour operators, and business travelers to increase occupancy rate. ·       Leverage the hotel's proximity to the airport and exhibition area. Threats: ·       Strong competition from other established hotel chains. ·       Negative customer experiences impacting the hotel's reputation. ·       Potential language and cultural barriers.

The Golden Arch Hotel capitalized on the McDonald's brand, leveraging its strong brand recognition to create awareness and generate interest in their hotel. This strategy allowed them to tap into the existing customer base and brand loyalty of McDonald's, which played a significant role in attracting guests to the hotel. Additionally, they implemented technologically advanced amenities, such as TV with wireless keyboard and an internet wifi (which at that time it was still uncommon), catering to the needs of their target market of business travelers. By offering these amenities, they successfully positioned the hotel as a convenient and attractive option for their intended audience.

However, there were areas for improvement. Aligning brand expectations was a crucial aspect that McDonald's should have addressed more effectively. The association between their fast-food brand and a four-star hotel created confusion among customers who expected a more upscale experience. McDonald's could have developed a sub-brand or utilized distinct branding elements to differentiate the Golden Arch Hotel from their fast-food restaurants, allowing them to set more appropriate customer expectations.

Another area for improvement is the enhancement of food and beverage offerings. The hotel's restaurant and bar lacked variety and ambiance, which detracted from the overall guest experience. Offering a wider range of food options, hiring skilled chefs, and investing in creating a more inviting ambiance could have significantly improved customer satisfaction. Incorporating elements of local cuisine and culture into the menu would have also helped the hotel to stand out from competitors and provide guests with a unique experience.

Lastly, ensuring a consistent customer experience is vital in the hotel industry. McDonald's should have focused on delivering a high-quality experience across all aspects of the hotel, such as room cleanliness, staff service, and facilities maintenance. By doing so, they could have fostered customer satisfaction and generated positive word of mouth, which is invaluable in the hospitality sector. To achieve this, they could have invested in comprehensive staff training and regular quality assurance audits to ensure that every aspect of the hotel met or exceeded the expectations of their guests.

By addressing these areas for improvement, the Golden Arch Hotel could have built on the strong foundation established through their effective use of the McDonald's brand and positioning in the market. Ultimately, this would have led to increased customer satisfaction and long-term success in the competitive hotel industry.

Recommendation for Lobsterland

Drawing from the Golden Arch Hotel case study, Lobster Land should focus on establishing a clear brand identity that differentiates it from competitors by emphasizing its unique seaside atmosphere, affordability, and family-friendly experience. Aligning brand expectations is crucial for Lobster Land to avoid confusion that might arise from brand associations. Localization and catering to local preferences is essential as Lobster Land continues to grow, which can be achieved by incorporating local ingredients and traditions, collaborating with local businesses, or creating region-specific attractions.

Ensuring a consistent, high-quality customer experience across all aspects of Lobster Land is vital for customer satisfaction and positive word of mouth. This includes ride operations, staff service, cleanliness, and other facilities. Lobster Land can also benefit from leveraging advanced technology solutions, such as park-issued wristbands for payments and ride access. By collecting and analyzing data from these wristbands, Lobster Land can optimize operations, enhance visitor experiences, and make data-driven decisions for future improvements.

Inspired by the Golden Arch Hotel case, Lobster Land can create seasonal events and promotions to attract visitors during peak times, with themed activities, entertainment, or limited-time food offerings that celebrate local culture and traditions. Demonstrating a commitment to sustainability and community engagement can enhance Lobster Land's reputation and appeal. Initiatives such as waste reduction, energy conservation, and support for local charities or events can help position the park as a responsible and caring member of the community. By implementing these recommendations, Lobster Land can strengthen its brand identity, enhance visitor experiences, and foster long-term success as a popular seaside amusement park.

Part 7 Classification¶


Classification Lobsterland Hotel Satisfaction Data¶

In [50]:
df= pd.read_csv("hotel_satisfaction.csv")
df.head()
Out[50]:
id Gender Age purpose_of_travel Type of Travel Type Of Booking Hotel wifi service Departure/Arrival convenience Ease of Online booking Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service Cleanliness satisfaction
0 70172 Male 13 aviation Personal Travel Not defined 3 4 3 1 5 5 5 4 5 5 neutral or dissatisfied
1 5047 Male 25 tourism Group Travel Group bookings 3 2 3 3 1 1 1 1 4 1 neutral or dissatisfied
2 110028 Female 26 tourism Group Travel Group bookings 2 2 2 2 5 5 5 4 4 5 satisfied
3 24026 Female 25 tourism Group Travel Group bookings 2 5 5 5 2 2 2 1 4 2 neutral or dissatisfied
4 119299 Male 61 aviation Group Travel Group bookings 3 3 3 3 4 5 3 3 3 3 satisfied
In [51]:
df.shape
Out[51]:
(103904, 17)
In [52]:
list(df.columns )
Out[52]:
['id',
 'Gender',
 'Age',
 'purpose_of_travel',
 'Type of Travel',
 'Type Of Booking',
 'Hotel wifi service',
 'Departure/Arrival  convenience',
 'Ease of Online booking',
 'Hotel location',
 'Food and drink',
 'Stay comfort',
 'Common Room entertainment',
 'Checkin/Checkout service',
 'Other service',
 'Cleanliness',
 'satisfaction']

Categorical and Numerical Variables

Categotical Variables are:

id, Gender, purpose_of_travel, Type of Travel, Type Of Booking, satisfaction

Numerical Variables are:

Age, Hotel wifi service, 'Departure/Arrival convenience','Ease of Online booking', 'Hotel location', 'Food and drink', 'Stay comfort', 'Common Room entertainment', 'Checkin/Checkout service', 'Other service', 'Cleanliness'

Values of target variable 'satisfaction'

In [53]:
df['satisfaction'].value_counts()
Out[53]:
neutral or dissatisfied    58879
satisfied                  45025
Name: satisfaction, dtype: int64

We do not have significant class imbalance present.

Are there NA's present in this dataset?

In [54]:
nums = df[['Age','Hotel location','Hotel wifi service',
 'Departure/Arrival  convenience',
 'Ease of Online booking',
 'Hotel location',
 'Food and drink',
 'Stay comfort',
 'Common Room entertainment',
 'Checkin/Checkout service',
 'Other service',
 'Cleanliness']]

cor_table = nums.corr()
In [55]:
plt.figure(figsize=(20, 12 ))
heatmap = sns.heatmap(cor_table, vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':20}, pad=20);

We examined the correlations of numerical variables to check for multicollinearity issues. We set a threshold of -0.7 or +0.7 to identify variables as multicollinear or not.

We identified the correlation of "ease of online booking" and "Hotel wifi service" as problematic. We see a .72 strong positive correlation and therefore decided to remove 'Ease of online booking'.

This makes sense as the hotel wifi service contributes to ease of booking. The variables are describing the same thing.

Remove ID

In [56]:
df2 = df.drop(['id', 'Ease of Online booking'], axis=1)

Convert Categorical Variables into numerical columns

In [57]:
cat_cols = ['Gender', 'purpose_of_travel', 'Type of Travel', 'Type Of Booking', 'satisfaction']

df_dum = pd.get_dummies(df2, drop_first=True,
                        columns= cat_cols)

Data Partition

In [58]:
from sklearn.model_selection import train_test_split
random.seed(8)
X = df_dum.loc[:, df_dum.columns != 'satisfaction_satisfied']
y = df_dum['satisfaction_satisfied']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.4, random_state=3)
In [59]:
print(X_train.shape, X_test.shape, type(X_train), type(X_test))
print(y_train.shape, y_test.shape, type(y_train), type(y_test))
(62342, 18) (41562, 18) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
(62342,) (41562,) <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>

Identifying Mean Values to check fro first hints on what might impact Satisfaction

In [60]:
churn_m = df_dum.groupby('satisfaction_satisfied').mean()
churn_m
Out[60]:
Age Hotel wifi service Departure/Arrival convenience Hotel location Food and drink Stay comfort Common Room entertainment Checkin/Checkout service Other service Cleanliness Gender_Male purpose_of_travel_aviation purpose_of_travel_business purpose_of_travel_personal purpose_of_travel_tourism Type of Travel_Personal Travel Type Of Booking_Individual/Couple Type Of Booking_Not defined
satisfaction_satisfied
0 37.566688 2.399633 3.129112 2.976121 2.95805 3.036295 2.894156 3.042952 3.388814 2.936123 0.487203 0.134819 0.203519 0.091425 0.307461 0.492026 0.646139 0.095960
1 41.750583 3.161288 2.970305 2.977879 3.52131 3.966530 3.964931 3.646041 3.969461 3.744342 0.499522 0.131216 0.205552 0.092504 0.309828 0.072826 0.193248 0.040955

Variables that might impactful based on simple mean comparison between outcome classes are:

  • Common Room Entertainment
  • Cleanliness
  • Food and Drink

Building our Predictive Model: Logitic Regression Model

First Iteration

In [61]:
logit_model = sm.Logit(y_train, sm.add_constant(X_train))

result = logit_model.fit()
print(result.summary())
Optimization terminated successfully.
         Current function value: inf
         Iterations 7
                             Logit Regression Results                             
==================================================================================
Dep. Variable:     satisfaction_satisfied   No. Observations:                62342
Model:                              Logit   Df Residuals:                    62323
Method:                               MLE   Df Model:                           18
Date:                    Wed, 10 Jan 2024   Pseudo R-squ.:                     inf
Time:                            00:11:00   Log-Likelihood:                   -inf
converged:                           True   LL-Null:                        0.0000
Covariance Type:                nonrobust   LLR p-value:                     1.000
=====================================================================================================
                                        coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
const                                -4.6981      0.079    -59.709      0.000      -4.852      -4.544
Age                                   0.0137      0.001     17.028      0.000       0.012       0.015
Hotel wifi service                    0.5716      0.010     56.119      0.000       0.552       0.592
Departure/Arrival  convenience       -0.0556      0.009     -5.879      0.000      -0.074      -0.037
Hotel location                       -0.1748      0.010    -16.872      0.000      -0.195      -0.155
Food and drink                       -0.1362      0.012    -11.068      0.000      -0.160      -0.112
Stay comfort                          0.2504      0.013     19.818      0.000       0.226       0.275
Common Room entertainment             0.3701      0.015     24.575      0.000       0.341       0.400
Checkin/Checkout service              0.3398      0.010     35.176      0.000       0.321       0.359
Other service                         0.1936      0.012     16.281      0.000       0.170       0.217
Cleanliness                           0.1716      0.014     12.405      0.000       0.145       0.199
Gender_Male                           0.0394      0.023      1.740      0.082      -0.005       0.084
purpose_of_travel_aviation            0.0043      0.038      0.112      0.911      -0.070       0.079
purpose_of_travel_business           -0.0307      0.033     -0.921      0.357      -0.096       0.035
purpose_of_travel_personal           -0.0058      0.043     -0.135      0.893      -0.091       0.079
purpose_of_travel_tourism            -0.0389      0.030     -1.294      0.196      -0.098       0.020
Type of Travel_Personal Travel       -1.8809      0.035    -53.810      0.000      -1.949      -1.812
Type Of Booking_Individual/Couple    -1.4889      0.027    -54.378      0.000      -1.543      -1.435
Type Of Booking_Not defined          -1.3149      0.048    -27.354      0.000      -1.409      -1.221
=====================================================================================================

Based on this regression output we can identify variables that do not have a high predictive value. Based on the p-values, we can say that 'purpose of travel' is insignificant across all levels. Gender is also not significant with a p-value of 0.08. We will therefore remove these variables and then run the regression again.

Iteration #2

In [62]:
X_train2 = X_train.drop(['purpose_of_travel_aviation',	'purpose_of_travel_business',	'purpose_of_travel_personal',	'purpose_of_travel_tourism', 'Gender_Male'], axis=1)
X_train2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 62342 entries, 4830 to 71530
Data columns (total 13 columns):
 #   Column                             Non-Null Count  Dtype
---  ------                             --------------  -----
 0   Age                                62342 non-null  int64
 1   Hotel wifi service                 62342 non-null  int64
 2   Departure/Arrival  convenience     62342 non-null  int64
 3   Hotel location                     62342 non-null  int64
 4   Food and drink                     62342 non-null  int64
 5   Stay comfort                       62342 non-null  int64
 6   Common Room entertainment          62342 non-null  int64
 7   Checkin/Checkout service           62342 non-null  int64
 8   Other service                      62342 non-null  int64
 9   Cleanliness                        62342 non-null  int64
 10  Type of Travel_Personal Travel     62342 non-null  uint8
 11  Type Of Booking_Individual/Couple  62342 non-null  uint8
 12  Type Of Booking_Not defined        62342 non-null  uint8
dtypes: int64(10), uint8(3)
memory usage: 5.4 MB
In [63]:
logit_model2 = sm.Logit(y_train, sm.add_constant(X_train2))

result2 = logit_model2.fit()
print(result2.summary())
Optimization terminated successfully.
         Current function value: inf
         Iterations 7
                             Logit Regression Results                             
==================================================================================
Dep. Variable:     satisfaction_satisfied   No. Observations:                62342
Model:                              Logit   Df Residuals:                    62328
Method:                               MLE   Df Model:                           13
Date:                    Wed, 10 Jan 2024   Pseudo R-squ.:                     inf
Time:                            00:11:00   Log-Likelihood:                   -inf
converged:                           True   LL-Null:                        0.0000
Covariance Type:                nonrobust   LLR p-value:                     1.000
=====================================================================================================
                                        coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------
const                                -4.6992      0.076    -62.112      0.000      -4.848      -4.551
Age                                   0.0137      0.001     17.035      0.000       0.012       0.015
Hotel wifi service                    0.5717      0.010     56.136      0.000       0.552       0.592
Departure/Arrival  convenience       -0.0556      0.009     -5.876      0.000      -0.074      -0.037
Hotel location                       -0.1748      0.010    -16.869      0.000      -0.195      -0.154
Food and drink                       -0.1355      0.012    -11.027      0.000      -0.160      -0.111
Stay comfort                          0.2501      0.013     19.793      0.000       0.225       0.275
Common Room entertainment             0.3693      0.015     24.541      0.000       0.340       0.399
Checkin/Checkout service              0.3398      0.010     35.178      0.000       0.321       0.359
Other service                         0.1937      0.012     16.291      0.000       0.170       0.217
Cleanliness                           0.1725      0.014     12.481      0.000       0.145       0.200
Type of Travel_Personal Travel       -1.8802      0.035    -53.791      0.000      -1.949      -1.812
Type Of Booking_Individual/Couple    -1.4892      0.027    -54.392      0.000      -1.543      -1.436
Type Of Booking_Not defined          -1.3169      0.048    -27.403      0.000      -1.411      -1.223
=====================================================================================================

Checked Model for seeing impact of Type of Travel and Type of Booking

In [64]:
X_train3 = X_train.drop(['purpose_of_travel_aviation',	'purpose_of_travel_business',	'purpose_of_travel_personal',	'purpose_of_travel_tourism', 'Gender_Male', 'Type of Travel_Personal Travel', 'Type Of Booking_Individual/Couple', 'Type Of Booking_Not defined'], axis=1)
logit_model3 = sm.Logit(y_train, sm.add_constant(X_train3))

result3 = logit_model3.fit()
print(result3.summary())
Optimization terminated successfully.
         Current function value: inf
         Iterations 6
                             Logit Regression Results                             
==================================================================================
Dep. Variable:     satisfaction_satisfied   No. Observations:                62342
Model:                              Logit   Df Residuals:                    62331
Method:                               MLE   Df Model:                           10
Date:                    Wed, 10 Jan 2024   Pseudo R-squ.:                     inf
Time:                            00:11:00   Log-Likelihood:                   -inf
converged:                           True   LL-Null:                        0.0000
Covariance Type:                nonrobust   LLR p-value:                     1.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                             -5.9751      0.066    -90.988      0.000      -6.104      -5.846
Age                                0.0173      0.001     26.030      0.000       0.016       0.019
Hotel wifi service                 0.6017      0.009     67.096      0.000       0.584       0.619
Departure/Arrival  convenience    -0.3234      0.008    -41.899      0.000      -0.339      -0.308
Hotel location                    -0.0515      0.009     -5.824      0.000      -0.069      -0.034
Food and drink                    -0.1813      0.011    -16.767      0.000      -0.202      -0.160
Stay comfort                       0.3565      0.011     33.132      0.000       0.335       0.378
Common Room entertainment          0.4031      0.013     30.622      0.000       0.377       0.429
Checkin/Checkout service           0.3500      0.008     41.396      0.000       0.333       0.367
Other service                      0.2476      0.010     23.895      0.000       0.227       0.268
Cleanliness                        0.0921      0.012      7.542      0.000       0.068       0.116
==================================================================================================

Evaluation¶

In [65]:
#train on logmodel scikit
logmodel= LogisticRegression()
logmodel.fit(X_train2, y_train)

#adjust Xtest for evaluation of logistic regression
X_test2 = X_test.drop(['purpose_of_travel_aviation',	'purpose_of_travel_business',	'purpose_of_travel_personal',	'purpose_of_travel_tourism', 'Gender_Male'], axis=1)

#predict
predictions_train = logmodel.predict(X_train2)

predictions_test = logmodel.predict(X_test2)

Confusion Matrix for Test Set

In [66]:
cm = confusion_matrix(predictions_test, y_test)
sns.heatmap(cm, square= True, fmt = 'g', annot=True, cbar= False)
plt.xlabel("actual result")
plt.ylabel("predicted result")
a,b = plt.ylim()
a+=.5
b-=.5
plt.ylim(a,b)
plt.show()
In [67]:
tn, fp, fn, tp = cm.ravel()
#
accuracy = (tp + tn) / (tp + tn + fp + fn)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
precision = tp / (tp + fp)
balanced_accuracy = (sensitivity + specificity) / 2

print("Accuracy:", accuracy)
print("Sensitivity:", sensitivity)
print("Specificity:", specificity)
print("Precision:", precision)
print("Balanced Accuracy:", balanced_accuracy)
Accuracy: 0.8405995861604351
Sensitivity: 0.8180702732849973
Specificity: 0.8576929587000677
Precision: 0.8134878819810326
Balanced Accuracy: 0.8378816159925324

Confusion Matrix for Train Set

In [68]:
cm2 = confusion_matrix(predictions_train, y_train)
sns.heatmap(cm2, square= True, fmt = 'g', annot=True, cbar= False)
plt.xlabel("actual result")
plt.ylabel("predicted result")
a,b = plt.ylim()
a+=.5
b-=.5
plt.ylim(a,b)
plt.show()
In [69]:
tn, fp, fn, tp = cm2.ravel()
#
accuracy = (tp + tn) / (tp + tn + fp + fn)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
precision = tp / (tp + fp)
balanced_accuracy = (sensitivity + specificity) / 2

print("Accuracy:", accuracy)
print("Sensitivity:", sensitivity)
print("Specificity:", specificity)
print("Precision:", precision)
print("Balanced Accuracy:", balanced_accuracy)
Accuracy: 0.8418401719547015
Sensitivity: 0.8171318575553417
Specificity: 0.860733695652174
Precision: 0.8177372749499889
Balanced Accuracy: 0.8389327766037578

There is a slight decrease in accuracy, but overall the model generalizes very well from training to test data. The accuracy metrics are high.

Additional Test Set Stats for Checked Model:

In [70]:
#train on logmodel scikit
logmodel= LogisticRegression()
logmodel.fit(X_train3, y_train)

#adjust Xtest for evaluation of logistic regression
X_test3 = X_test.drop(['purpose_of_travel_aviation',	'purpose_of_travel_business',	'purpose_of_travel_personal',	'purpose_of_travel_tourism', 'Gender_Male', 'Type of Travel_Personal Travel', 'Type Of Booking_Individual/Couple', 'Type Of Booking_Not defined'], axis=1)

#predict
predictions_train = logmodel.predict(X_train3)
predictions_test = logmodel.predict(X_test3)
In [71]:
cm2 = confusion_matrix(predictions_train, y_train)
sns.heatmap(cm2, square= True, fmt = 'g', annot=True, cbar= False)
plt.xlabel("actual result")
plt.ylabel("predicted result")
a,b = plt.ylim()
a+=.5
b-=.5
plt.ylim(a,b)
plt.show()
In [72]:
tn, fp, fn, tp = cm2.ravel()
#
accuracy = (tp + tn) / (tp + tn + fp + fn)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
precision = tp / (tp + fp)
balanced_accuracy = (sensitivity + specificity) / 2

print("Accuracy:", accuracy)
print("Sensitivity:", sensitivity)
print("Specificity:", specificity)
print("Precision:", precision)
print("Balanced Accuracy:", balanced_accuracy)
Accuracy: 0.7644445157357801
Sensitivity: 0.7332575326890278
Specificity: 0.7873293100091776
Precision: 0.7167148255167816
Balanced Accuracy: 0.7602934213491027

This model performs worse than the model above

Interpretation of the Logitic Regression Model and its meaning:

To give attendees of the conference a better understanding of what is influencing their guests satisfaction, we can look at the coeffiecients of the logistic regression constructed above and draw conclusions about the impact of certain variables on satisfaction.

Based on coefficients of the model (iteration #2) we can say that Hotel Wifi Service is one of the most impactful things when it comes to customer satisfaction. Marketing initiatives geared towards exploiting this particular consumer preference would be installing and then advertising supreme internet connectivity ability. On the booking page, a sticker with "500mb/s is our standard" hotels might be able to convince guests that are looking for a connected stay.

Three variables, related to convenience inside the hotel, that strongly impact satisfaction positively are common room entertainment, check-in/check-out, and stay conveniece in general. In order to use the first variable, a hotel could advertise certain paid-for channels (e.g. "our rooms have netflix/hbo on demand"). Based on the second variable, a hotel might be better off to analyze their check-in and check-out processes and adjust.

The type of travel can have a significantly negative impact on the travel experience. It would be worthwhile for a hotel management to analyze the different preferences for different types of bookings to see how they can tailor these preferences around specific booking groups. Other variables that hotels should be wary about when trying to please customer is the departure arrival conveniece. When imagining a city based hotel, clear instructions on how to find the hotel and where to park might help with satisfaction. Obstacle free movement of luggage is a second thing to look at.

Part 8 Statistical Testing¶

In [73]:
st= pd.read_csv("promo_pics.csv")
st.head()
Out[73]:
recipient pic_seen site_duration spend register
0 1 Sunset 18.20 16.60 0
1 2 Main St 28.61 15.30 0
2 3 Waterslide 10.90 16.32 1
3 4 Waterslide 11.30 22.62 0
4 5 Sunset 19.70 17.30 0
In [74]:
st.describe()
Out[74]:
recipient site_duration spend register
count 3400.000000 3400.000000 3400.000000 3400.000000
mean 1700.500000 19.381494 16.461041 0.376176
std 981.639785 6.559998 2.707941 0.484496
min 1.000000 9.300000 7.400000 0.000000
25% 850.750000 11.300000 14.900000 0.000000
50% 1700.500000 21.500000 16.600000 0.000000
75% 2550.250000 24.400000 17.820000 1.000000
max 3400.000000 38.510000 27.120000 1.000000

To perform A/B testing between the three options we will use the binomial z-test. We will use A to B, A to C, and B to C testing to see if there is a significant difference in site duration, spend, and registered attendees.

A: Sunset

B: Main St.

C: Waterslide

Visual Inspection

In [75]:
sns.histplot(data=st, x='site_duration', hue='pic_seen', kde=False, bins=50)
plt.show()
In [76]:
sns.histplot(data=st, x='spend', hue='pic_seen', kde=False, bins=50)
plt.show()

There seem to be differences in mean between all three picture options for both the spend and the site_duration variable.

A/B Testing

In [77]:
from scipy import stats
t, p = stats.ttest_ind(st.loc[st['pic_seen'] == "Sunset", 'spend'].values,st.loc[st['pic_seen'] == "Main St", 'spend'].values, equal_var=False)

print("T-stat: ", t)
print("P-Value: ", p)
T-stat:  41.02687256136666
P-Value:  1.7008147207546044e-248

This T test indicates that we cannot, with 95% certainty say that there is a statistically significant difference between spending of attendees when they see the Sunset picture vs. when they see the Main St. picture.

In [78]:
sunset_spend = st[st['pic_seen'] == 'Sunset']['spend']
main_spend = st[st['pic_seen'] == 'Main St']['spend']
water_spend = st[st['pic_seen'] == 'Waterslide']['spend']
In [79]:
t_test_BC = stats.ttest_ind(main_spend, water_spend)
print("Main St to Waterfall", t_test_BC)

t_test_AB = stats.ttest_ind(sunset_spend, main_spend)
print("Sunnset to Main St", t_test_AB)

t_test_AC = stats.ttest_ind(sunset_spend, water_spend)
print("Sunnset to Waterfall", t_test_AC)
Main St to Waterfall Ttest_indResult(statistic=-48.202035799343925, pvalue=0.0)
Sunnset to Main St Ttest_indResult(statistic=40.53370629078891, pvalue=2.4333914152341534e-270)
Sunnset to Waterfall Ttest_indResult(statistic=-23.596074151014385, pvalue=3.5328205740249146e-110)

For the spend variable, there is a significant difference in spending outcome, depending on wether the picture shown in the email is Waterslide or Mainstreet. There are no significant differences between the other options for pictures.

In [80]:
sunset_site = st[st['pic_seen'] == 'Sunset']['site_duration']
main_site = st[st['pic_seen'] == 'Main St']['site_duration']
water_site = st[st['pic_seen'] == 'Waterslide']['site_duration']
In [81]:
t_test_BC = stats.ttest_ind(main_site, water_site)
print("Main St to Waterfall", t_test_BC)

t_test_AB = stats.ttest_ind(sunset_site, main_site)
print("Sunnset to Main St", t_test_AB)

t_test_AC = stats.ttest_ind(sunset_site, water_site)
print("Sunnset to Waterfall", t_test_AC)
Main St to Waterfall Ttest_indResult(statistic=112.12607762440105, pvalue=0.0)
Sunnset to Main St Ttest_indResult(statistic=-9.911415184705456, pvalue=1.0724418388111107e-22)
Sunnset to Waterfall Ttest_indResult(statistic=180.71977956193996, pvalue=0.0)

When recipients see the Waterslide picture in the email their stay on the site is significantly shorter than when they see the Main St. or Waterfall picture.

Binomial Z-Test for 'register' variable

In [82]:
st['pic_seen'].value_counts()
Out[82]:
Main St       1148
Waterslide    1142
Sunset        1110
Name: pic_seen, dtype: int64
In [83]:
st['register'].value_counts(normalize=True)
Out[83]:
0    0.623824
1    0.376176
Name: register, dtype: float64
In [84]:
st[['pic_seen', 'register']].value_counts()
Out[84]:
pic_seen    register
Main St     0           756
Waterslide  0           744
Sunset      0           621
            1           489
Waterslide  1           398
Main St     1           392
dtype: int64
In [85]:
prop_compare = st.groupby('pic_seen')[['register']].mean()
prop_compare
Out[85]:
register
pic_seen
Main St 0.341463
Sunset 0.440541
Waterslide 0.348511
In [86]:
pic_spend = st.groupby('pic_seen')[['spend']].mean()
pic_spend
Out[86]:
spend
pic_seen
Main St 14.016289
Sunset 16.781892
Waterslide 18.606778
In [87]:
from statsmodels.stats.proportion import proportions_ztest

table = pd.crosstab(st['pic_seen'].isin(['Waterslide', 'Main St']), st['register'])
stat, pval = proportions_ztest(table.iloc[:, 1], table.sum(axis=1))

print("Test statistic:", stat)
print("p-value:", pval)
Test statistic: 5.393864291006912
p-value: 6.895830810111409e-08
In [88]:
table = pd.crosstab(st['pic_seen'].isin(['Waterslide', 'Sunset']), st['register'])

stat, pval = proportions_ztest(table.iloc[:, 1], table.sum(axis=1))

print("Test statistic:", stat)
print("p-value:", pval)
Test statistic: -2.9832686777741704
p-value: 0.0028518752251626817
In [89]:
table = pd.crosstab(st['pic_seen'].isin(['Sunset', 'Main St']), st['register'])

stat, pval = proportions_ztest(table.iloc[:, 1], table.sum(axis=1))

print("Test statistic:", stat)
print("p-value:", pval)
Test statistic: -2.368186457800072
p-value: 0.017875524587415727

For statistical testing of this project, we conducted an AB Test on the 'promo_pics' dataset. After we imported the dataset, we checked the distribution of the variables used as predictors in the model; spend (amount spent) and the site_duration (time on website). These variables followed a normal distribution, which is important to increase the model's reliability. We then conducted t-tests on the three pictures with the spending variable. The test determined whether the two variables have significantly different means and used a p-value threshold of 0.05. All three tests showed a p-value of less than 0.05; Main St to Waterfall (0); Sunset to Main St (2.43e-270), and Sunset to Waterfall (3.53e-110), we can conclude the difference did not happen coincidentally, and there is a significant difference. If Lobsterland wants to increase spend, they should show the Waterslide picture in the email.

The second t-test, based on the variable site duration, showed p-values for Main St to Waterfall and Sunnset to Waterfall, with Sunnset to Main St showing a value of 1.072-22. This tells us the mean differences for site duration are not coincidental, and the picture impacts the consumer's decision-making process. In order to achieve the longest site duration, Lobsterland should show the Main St. picture.

The next step in the statistical test is determining which image has the highest percentage of registered people. The results showed 34.1% Main St, 44.05% Sunset, and 34.9% Waterslide. We had to determine whether or not the correlation between registering and the picture was statistically significant. Thus, we conducted a binomial z-test between each photo and the register output and found all p-values were less than 0.05, so we can reject the null hypothesis that there is no difference in customer's registering given each photo and accept that there is a statistical difference, which shows Sunset as the most influential photo. Therefore, Lobsterland should use Sunset to increase the number of people who register for the conference.

Part 9 Conclusion¶

In conclusion, the insights gathered from the conjoint analysis, segmentation, and targeting, as well as the various recommendations provided for Lobster Land, all serve to help create a more appealing and successful seaside amusement park. By focusing on providing the right amenities at the best value per marginal cost, offering scalable services, and ensuring a consistent, high-quality customer experience, Lobster Land can effectively cater to its target audience while maintaining a strong brand identity.

Furthermore, it is essential for Lobster Land to keep a customer-centric approach and leverage data to make informed decisions for future improvements. By creating seasonal events, promoting sustainability, and engaging with the local community, Lobster Land can enhance its reputation and foster long-term success. These findings and recommendations not only have the potential to improve Lobster Land's operations and visitor experiences but also provide valuable insights into the process of analyzing and strategizing for an amusement park business. By applying these insights, Lobster Land can continue to grow and thrive as a popular destination for both local and international guests.

To ensure success, Lobster Land should focus on emphasizing its unique selling proposition (USP) by highlighting the distinctive seaside atmosphere, affordability, and family-friendly experiences. It is crucial to provide the right amenities at the best value per marginal cost, offer scalable services during peak and off-seasons, and maintain a consistent, high-quality customer experience. Additionally, Lobster Land should engage with the local community, create seasonal events, and promote sustainability initiatives. By leveraging data-driven insights and keeping a customer-centric approach, Lobster Land can foster long-term success and appeal to a broad range of visitors.